Análisis exploratorio para la base de Algas
Cargamos nuestros datos y hacemos una pequeña exploración de los mismos:
algas <- read.table(file = "/home/jared/Proyectos/itam-dm/data/algas/algas.txt",
header = FALSE,
dec = ".",
col.names = c('temporada', 'tamaño', 'velocidad', 'mxPH',
'mnO2', 'Cl', 'NO3', 'NO4', 'oPO4', 'PO4',
'Chla', 'a1', 'a2', 'a3', 'a4', 'a5', 'a6', 'a7'),
na.strings=c('XXXXXXX')
)
head(algas)
## temporada tamaño velocidad mxPH mnO2 Cl NO3 NO4 oPO4
## 1 winter small medium 8.00 9.8 60.800 6.238 578.000 105.000
## 2 spring small medium 8.35 8.0 57.750 1.288 370.000 428.750
## 3 autumn small medium 8.10 11.4 40.020 5.330 346.667 125.667
## 4 spring small medium 8.07 4.8 77.364 2.302 98.182 61.182
## 5 autumn small medium 8.06 9.0 55.350 10.416 233.700 58.222
## 6 winter small high 8.25 13.1 65.750 9.248 430.000 18.250
## PO4 Chla a1 a2 a3 a4 a5 a6 a7
## 1 170.000 50.0 0.0 0.0 0.0 0.0 34.2 8.3 0.0
## 2 558.750 1.3 1.4 7.6 4.8 1.9 6.7 0.0 2.1
## 3 187.057 15.6 3.3 53.6 1.9 0.0 0.0 0.0 9.7
## 4 138.700 1.4 3.1 41.0 18.9 0.0 1.4 0.0 1.4
## 5 97.580 10.5 9.2 2.9 7.5 0.0 7.5 4.1 1.0
## 6 56.667 28.4 15.1 14.6 1.4 0.0 22.5 12.6 2.9
describe(algas)
## algas
##
## 18 Variables 200 Observations
## ---------------------------------------------------------------------------
## temporada
## n missing unique
## 200 0 4
##
## autumn (40, 20%), spring (53, 26%), summer (45, 22%)
## winter (62, 31%)
## ---------------------------------------------------------------------------
## tamaño
## n missing unique
## 200 0 3
##
## large (45, 22%), medium (84, 42%), small (71, 36%)
## ---------------------------------------------------------------------------
## velocidad
## n missing unique
## 200 0 3
##
## high (84, 42%), low (33, 16%), medium (83, 42%)
## ---------------------------------------------------------------------------
## mxPH
## n missing unique Info Mean .05 .10 .25 .50
## 199 1 72 1 8.012 7.081 7.340 7.700 8.060
## .75 .90 .95
## 8.400 8.700 8.873
##
## lowest : 5.60 5.70 6.40 6.50 6.60, highest: 9.00 9.06 9.10 9.50 9.70
## ---------------------------------------------------------------------------
## mnO2
## n missing unique Info Mean .05 .10 .25 .50
## 198 2 88 1 9.118 4.485 5.770 7.725 9.800
## .75 .90 .95
## 10.800 11.700 11.815
##
## lowest : 1.5 1.8 3.2 3.3 3.4, highest: 12.5 12.6 12.9 13.1 13.4
## ---------------------------------------------------------------------------
## Cl
## n missing unique Info Mean .05 .10 .25 .50
## 190 10 178 1 43.64 3.061 4.970 10.981 32.730
## .75 .90 .95
## 57.823 88.600 130.087
##
## lowest : 0.222 0.800 1.170 1.450 1.549
## highest: 173.750 187.183 194.750 208.364 391.500
## ---------------------------------------------------------------------------
## NO3
## n missing unique Info Mean .05 .10 .25 .50
## 198 2 192 1 3.282 0.4023 0.6912 1.2960 2.6750
## .75 .90 .95
## 4.4463 6.1916 7.9369
##
## lowest : 0.050 0.102 0.130 0.230 0.267
## highest: 9.248 9.715 9.773 10.416 45.650
## ---------------------------------------------------------------------------
## NO4
## n missing unique Info Mean .05 .10 .25 .50
## 198 2 179 1 501.3 10.00 15.00 38.33 103.17
## .75 .90 .95
## 226.95 805.33 1922.87
##
## lowest : 5.0 5.8 8.0 10.0 10.5
## highest: 4073.3 5738.3 6400.0 8777.6 24064.0
## ---------------------------------------------------------------------------
## oPO4
## n missing unique Info Mean .05 .10 .25 .50
## 198 2 173 1 73.59 2.00 3.94 15.70 40.15
## .75 .90 .95
## 99.33 193.21 248.34
##
## lowest : 1.000 1.250 1.333 1.625 1.800
## highest: 346.167 412.333 428.750 467.500 564.600
## ---------------------------------------------------------------------------
## PO4
## n missing unique Info Mean .05 .10 .25 .50
## 198 2 189 1 137.9 6.455 11.350 41.375 103.285
## .75 .90 .95
## 213.750 286.100 345.650
##
## lowest : 1.0 2.5 3.0 4.0 6.0
## highest: 558.8 586.0 607.2 624.7 771.6
## ---------------------------------------------------------------------------
## Chla
## n missing unique Info Mean .05 .10 .25 .50
## 188 12 131 1 13.97 0.500 0.800 2.000 5.475
## .75 .90 .95
## 18.308 31.817 61.733
##
## lowest : 0.20 0.30 0.40 0.50 0.60
## highest: 88.25 92.67 93.68 98.82 110.46
## ---------------------------------------------------------------------------
## a1
## n missing unique Info Mean .05 .10 .25 .50
## 200 0 121 0.99 16.92 0.00 0.00 1.50 6.95
## .75 .90 .95
## 24.80 50.72 64.33
##
## lowest : 0.0 1.1 1.2 1.4 1.5, highest: 75.8 81.9 82.7 86.6 89.8
## ---------------------------------------------------------------------------
## a2
## n missing unique Info Mean .05 .10 .25 .50
## 200 0 89 0.95 7.458 0.00 0.00 0.00 3.00
## .75 .90 .95
## 11.38 21.50 28.38
##
## lowest : 0.0 1.0 1.2 1.4 1.5, highest: 40.7 40.9 41.0 53.6 72.6
## ---------------------------------------------------------------------------
## a3
## n missing unique Info Mean .05 .10 .25 .50
## 200 0 79 0.95 4.309 0.000 0.000 0.000 1.550
## .75 .90 .95
## 4.925 13.510 20.275
##
## lowest : 0.0 1.0 1.1 1.2 1.4, highest: 24.8 25.3 25.9 35.1 42.8
## ---------------------------------------------------------------------------
## a4
## n missing unique Info Mean .05 .10 .25 .50
## 200 0 50 0.84 1.992 0.000 0.000 0.000 0.000
## .75 .90 .95
## 2.400 5.000 7.605
##
## lowest : 0.0 1.0 1.1 1.2 1.3, highest: 11.5 12.7 13.4 28.8 44.6
## ---------------------------------------------------------------------------
## a5
## n missing unique Info Mean .05 .10 .25 .50
## 200 0 81 0.94 5.064 0.00 0.00 0.00 1.90
## .75 .90 .95
## 7.50 14.91 20.04
##
## lowest : 0.0 1.0 1.1 1.2 1.4, highest: 28.8 34.2 34.3 35.6 44.4
## ---------------------------------------------------------------------------
## a6
## n missing unique Info Mean .05 .10 .25 .50
## 200 0 76 0.85 5.964 0.000 0.000 0.000 0.000
## .75 .90 .95
## 6.925 17.110 31.815
##
## lowest : 0.0 1.0 1.2 1.4 1.5, highest: 42.7 49.4 52.5 64.6 77.6
## ---------------------------------------------------------------------------
## a7
## n missing unique Info Mean .05 .10 .25 .50
## 200 0 51 0.88 2.496 0.00 0.00 0.00 1.00
## .75 .90 .95
## 2.40 6.10 10.88
##
## lowest : 0.0 1.0 1.1 1.2 1.4, highest: 22.1 25.6 30.1 31.2 31.6
## ---------------------------------------------------------------------------
Ahora con la función que creamos en utils.r, hacemos una exploración visual de los datos, esto para comprender mejor como se distribuyen.
for(i in 1:ncol(algas)){
print (graf_expl(algas,names(algas[i])))
}
Ahora veamos las relaciones de dos en dos de las variables.
for(i in 1:(ncol(algas)-1)){
for (e in (i+1):ncol(algas)){
print (graf_expl2(algas,names(algas[i]),names(algas[e])))
}
}
Aquà observamos las relaciones entre los datos faltantes, se observa que hay una clara relación entre la variable Chla y Cl. Como también se puede ver una leve relación entre las variables Cl, NO3, NO4,oPO4,PO4 y Chla.
corr_na<-cor(is.na(algas))
corr_na[is.na(corr_na)]<-0
corrplot(corr_na)
Hacemos una función que sustituya los NA por la media, cuando es una variable numerica y por la moda cuando es una variable categorica.
sust_na<-function(col){
if (is.numeric(col)){
media<-mean(col, na.rm=T)
col[is.na(col)]<-media
} else {
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
moda<-Mode(col)
col[is.na(col)]<-moda
}
col
}
Ahora aplicamos esta función a todas las columas:
algas_2<-apply(algas,2,sust_na)
Comprobamos que no haya NAs,
corr_na_2<-cor(is.na(algas_2))
corr_na_2[is.na(corr_na_2)]<-0
corrplot(corr_na_2)
Como hubodos columnas en las que no le aplico la función, vamos a hacerlo con for loop.
algas_3<-algas
for (i in 1:ncol(algas)){
algas_3[,i]<-sust_na(algas[,i])
}
Ahora vemos que sà removimos los NAs, con lo que concluimos nuestro análisis de NAs.
corr_na_3<-cor(is.na(algas_3))
corr_na_3[is.na(corr_na_3)]<-0
corrplot(corr_na_3)